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(57) ABSTRACT 

Home networks low-cost digital interfaces are introduced 
that integrate entertainment, communication and computing 
electronics into consumer multimedia. Normally, these are 
low-cost, easy to use systems, since they allow the user to 
remove or add any kind of network devices with the bus 
being active. To improve the user interface a speech unit (2) 
is proposed that enables all devices (11) connected to the bus 
system (31) to be controlled by a single speech recognition 
device. The properties of this device, e.g. the vocabulary_c an 
be dynamically and actively extended by the consume r 
de vices (11) connected to the bus system (3i yrhe_4iroposed 
technology is independent from a specific bus standard, e.g. 
the IEEE 1394 standard, and is well-suited for all kinds of 
wired wireless home networks. The speech unit (2) receives 
data and messages from the device. The speech unit (2) 
recognizes speaker-dependent commands. A Speech synthe- 
sizer synthesizes messages. A remotely controllable device 
(11) has access to a medium which may be a CD-ROM. The 
device may ask for a logjcal name or identifier. 
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SPEECH RECOGNITION CONTROL OF 
REMOTELY CONTROLLABLE DEVICES IN 
A HOME NETWORK ENVIRONMENT 

DESCRIPTION 

This invention relates to a speech interface in a home 
network environment. In particular, it is concerned with a 
speech recognition device, a remotely controllable device 
and a method of self- initialization of a speech recognition 
device. 

Generally, speech recognizers are known for controlling 
different consumer devices, i.e. television, radio, car 
navigation, mobile telephone, camcorder, PC, printer, heat- 
ing of buildings or rooms. Each of these speech recognizers 
is built into a specific device to control it. The properties of 
such a recognizer, such as the vocabulary, the grammar and 
the corresponding commands, are designed for this particu- 
lar task. 

On the other hand, technology is now available to connect 
different of the above listed consumer devices via a home 
network with dedicated bus systems, e.g. a IEEE 1394 bus. 
Devices adapted for such systems communicate by sending 
commands and data to each other. Usually such devices 
identify themselves when they are connected to the network 
and get a unique address assigned by a network controller. 
Thereafter, these addresses can be used by all devices to 
communicate with each other. All other devices already 
connected to such a network are informed about address and 
type of a newly connected device. Such a network will be 
included in private homes as well as cars. 

Speech recognition devices enhance comfort and, if used 
in a car may improve security, as the operation of consumer 
devices becomes more and more complicated, e.g. control- 
ling of a car stereo. Also in a home network environment e.g. 
the programming of a video recorder or the selection of 
television channels can be simplified when using a speech 
recognizer. On the other hand, speech recognition devices 
have a rather complicated structure and need a quite expen- 
sive technology when a reliable and flexible operation 
should be secured, therefore, a speech recognizer will not be 
affordable for most of the devices listed above. 

Therefore, it is the object of the present invention to 
provide a generic speech recognizer facilitating the control 
of several devices. Further, it is the object of the present 
invention to provide a remotely controllable device that 
simplifies its network-controllability via speech. 

A further object is to provide a method of self- 
initialization of the task dependent parts of such a speech 
recognition device to control such remotely controllable 
devices. 

These objects are respectively achieved as defined in the 
independent claims 1, 4, 14, 15 and 18. 

Further preferred embodiments of the invention are 
defined in the respective subclaims. 

The present invention will become apparent and its 
numerous modifications and advantages will be better 
understood from the following detailed description of an 
embodiment of the invention taken in conjunction with the 
accompanying drawings, wherein 

FIG. 1 shows a block diagram of an example of a speech 
unit according to an embodiment of the invention; 

FIG. 2 shows a block diagram of an example of a network 
device according to an embodiment of the invention; 

FIG. 3 shows an example of a wired 1394 network having 
a speech unit and several 1394 devices; 
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FIG. 4 shows an example of a wired 1394 network having 
a speech unit incorporated in a 1394 device and several 
normal 1394 devices; 

FIG. 5 shows three examples of different types of net- 
5 works; 

FIG. 6 shows an example of a home network in a house 
having three clusters; 

FIGS. 1(a) and 1(b) show two examples of controlling a 

1Q network device remotely via a speech recognizer; 

FIG. 8 shows an example of a part of a grammar for a user 
dialogue during a VCR programming; 

FIG. 9 shows an example of a protocol of the interaction 
between a user, a speech recognizer and a network device; 

15 FIG. 10 shows an example of a learning procedure of a 
connected device, where the name of the device is deter- 
mined automatically; 

FIG. 11 shows an example of a protocol of a notification 
procedure of a device being newly connected, where the user 

20 is asked for the name of the device; 

FIG. 12 shows an example of a protocol of the interaction 
of multiple devices for vocabulary extensions concerning 
media contents; and 

FIG. 13 shows another example of a protocol of the 
interaction of multiple devices for vocabulary extensions 
concerning media contents. 

FIG. 1 shows a block diagram of an example of the 
structure of a speech unit 2 according to the invention. Said 

30 speech unit 2 is connected to a microphone 1 and a 
loudspeaker, which could also be built into said speech unit 
2. The speech unit 2 comprises a speech synthesizer, a 
dialogue module, a speech recognizer and a speech inter- 
preter and is connected to an IEEE 1394 bus system 10. It 

35 is also possible that the microphone 1 and/or the loudspeaker 
are connected to the speech unit 2 via said bus system 10. Of 
course it is then necessary that the microphone 1 and/or the 
loudspeaker are respectively equipped with a circuitry to 
communicate with said speech unit 2 via said network, such 

40 as A/D and D/A converters and/or command interpreters, so 
that the microphone 1 can transmit the electric signals 
corresponding to received spoken utterances to the speech 
unit 2 and the loudspeaker can output received electric 
signals from the speech unit 2 as sound. 

45 IEEE 1394 is an international standard, low-cost digital 
interface that will integrate entertainment, communication 
and computing electronics into consumer multimedia. It is a 
low-cost easy-to-use bus system, since it allows the user to 
remove or add any kind of 1394 devices with the bus being 

50 active. Although the present invention is described in con- 
nection with such an. IEEE 1394 bus system and IEEE 1394 
network devices, the proposed technology is independent 
from the specific IEEE 1394 standard and is well-suited for 
all kinds of wired or wireless home networks or other 

55 networks. 

As will be shown in detail later, a speech unit 2, as shown 
in FIG. 1 is connected to the home network 10. This is a 
general purpose speech recognizer and synthesizer having a 
generic vocabulary. The same speech unit 2 is used for 

60 controlling all of the devices 11 connected to the network 10. 
The speech unit 2 picks up a spoken-command from a user 
via the microphone 1, recognizes it and converts it into a 
corresponding home network control code, henceforth 
called user- network-command, e.g. specified by the IEEE 

65 1394 standard. This control code is then sent to the appro- 
priate device that performs the action associated with the 
user-network-command. 
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To be capable of enabling all connected network devices initial vocabulary section la, an extended vocabulary see- 
to be controlled by speech, the speech unit has to "know" the tion lb, an initial grammar section 7c, an extended grammar 
commands that are needed to provide operability of all section Id and a software section If that comprises a 
individual devices 11. Initially, the speech unit "knows" a recognition section and a grapheme-phoneme conversion 
basic set of commands, e.g., commands that are the same for 5 section of the memory 7. Further, the central processing unit 
various devices. There can be a many-to-one mapping 4 is bidirectionaUy coupled to me home network system 10 
between spoken-commands from a user and user-network- ^ can also send messages to a digital signal processor 9 
commands generated therefrom. Suchj^ke^^ included in the speech unit 2 composing a speech generation 
v can eg. be play^search for radio statlo^^iZ^^L 9a tha | serves /° synthesize messages into speech 
V^lffnumEeTs" such as phone numbers. 1 ne se commands can 10 and 0Ut P uts thls s P eech t0 a loudspeaker. 

be spoTelfnTTs^ ^ can be exphcitlv or i mplicitly The antral processing unit 4 is bidirectionaUy coupled to 

embeo^eTwlthin full sentepgcTTuTsentenocs w ill hence- ' the home network 10 via a link layer control unit 5 and an 

forth ^ell be called sp oken-command. interface (I/F) physical layer unit 6. These units serve to 

y r~ ; : t~" — r~ . . r filter out network-commands from bus 10 directed to the 

In general, speech recognizers and technologies for M ^ k ^ m ^ to Mlccted 

speech recognition, interpretation, and dialogues are well- ^ » connected to the network 10 
known and will not be explained in detail in connection with . . , . , 
thisinvention.Basically,aspeechrecognizercomprise S aset Therefore, it is also possible that new user-network- 
of vocabulary and a set of knowledge-bases (henceforth commands together with corresponding vocabulary and 
grammars) according to which a spoken-command from a g rammars can b * learned by the speech umt 2 directly from 
user is converted into a user-network-command that can be 20 other network devices. To perform such a learning, he 
carried out by a device. Tne speech recognizer also may use speech unit 2 can send control commands stored in he 
a set of alternative pronunciations associated with each memory 8 to control the network devices henceforth called 
vocabulary word. The dialogue with the user will be con- control-network-commands to request their user-network- 
ducted according to some dialogue model. commands and correspondmg vocabulary and grammars 
.„ J . LJ . r.25 according to which they can be controlled by a user. The_ 
The speech unit 2 according to an embodiment of the memnrv 7 comprises an extended vocab ulary section lb and 
invention comprises a digital signal processor 3 connected to an e^^sa^^ 

the microphone 1. The digital signal processor 3 receives the voca5ul or grammars . These sections are respectively 
electric signals corresponding to the spoken-command from desi ^ ^ the initial V0C abulary section la and the initial 
the microphone 1 and performs a first processing to convert 3o mar section 7c> bm ne wly input user-network- 
these electric signals into digital words recognizable by a commands together with information needed to identify 
central processing unit 4. To be able to perform this first these ^.network-commands can be stored in the extended 
processing, the digital signal processor 3 is bidirectionaUy vocabulary xCi km lb and the extended grammar sectionJLL. 
coupled to a memory 8 holding information about the . {h& central unit 4 In tnis way> me S peechunit 
process to be carried out by the digital signal processor 3 and ^ % c ^ leam user . network . comma nds and corresponding 
a speech recogmtion section 3a included therein. Further, the vocabulary and grammars built into an arbitrary network 
digital signal processor 3 is connected to a feature extraction deyice New network devices have then no need to have a 
section le of a memory 7 wherein information is stored of gpeech recognition devicCj but only the user- 
how to convert electric signals corresponding to spoken- netW ork-commands and corresponding vocabulary and 
commands into digital words corresponding thereto. In other ^ mars that should be co^di^ie via a speech recogni- 
words, the digital signal processor 3 converts the spoken- { . Qn m Funh ^ ^ faas tQ be a fadlfty tQ tfansfer 
command from a user input via the microphone 1 into a ^ daU tQ ^ speech ^ 2 ^ speech unh 2 according 
computer recognizable form, e.g. text code. (o ^ invention learns ^ use r-network-commands and 
The digital signal processor 3 sends the generated digital corresponding vocabulary and grammar and the respective 
words to the central processing unit 4. The central process- 45 device can De voice-controlled via the speech unit 2. 
ing unit 4 converts these digital words into user-network- ^ initial vocabulary sec tion la and the initial grammar 
commands sent to the home network system 10. Therefore, section 7c store a basic set of user-network-commands that 
the digital signal processor 3 and the central processing unit cafl be used for various dev ices, like user-network- 
4 can be seen as speech recognizer, dialogue module and comma nds corresponding to the spoken-commands switch 
speech interpreter. 50 on> switch off> 

pause, louder, etc., these user-network- 
it is also possible that the digital signal processor 3 only commands are stored in connection with vocabulary and 
performs a spectrum analysis of the spoken-command from grammars needed by the central processing unit 4 to identify 
a user input via the microphone 1 and the word recognition i ne m out of the digital words produced by the speech 
itself is conducted in the central processing unit 4 together recognition section via the digital signal processor 3. 
with the convertion into user-network-commands. Depend- 55 Further, questions or messages are stored in a memory, 
ing on the capacity of the central processing unit 4, it can These can be output from the speech unit 2 to a user. Such 
also perform the spectrum analysis and the digital signal questions or messages may be used in a dialogue in-between 
processor 3 can be omitted. the speech unit 2 and the user to complete commands spoken 
Further, the central processing unit 4 provides a learning by the user into proper user-network-commands, examples 
function for the speech unit 2 so that the speech unit 2 can 60 are please repeat, which device, do you really want to switch 
learn new vocabulary, grammar and user-network- on*?, etc. All such messages or questions are stored together 
commands to be sent to a network device 11 corresponding with speech data needed by the central processing unit 4 to 
thereto. To be able to perform these tasks the central generate digital words to be output to the speech generation 
processing unit 4 is bidirectionaUy coupled to the memory 8 and synthesis section 9a of the digital signal processor 9 to 
that is also holding information about the processes to be 65 generate spoken utterances output to the user via the loud- 
performed by the central processing unit 4. Further, the speaker. Through the microphone 1, the digital signal pro- 
central processing unit 4 is bidirectionaUy coupled to an cessors 3 and 9 and the loud-speaker a "bidirectional cou- 
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pling" of the central processing unit 4 with a user is possible. 
Therefore, it is possible that the speech unit 2 can commu- 
nicate with a user and learn from him or her. Like in the case 
of the communication with a network device 11, the speech 
unit 2 can access a set of control-network-commands stored 
in the memory 8 to instruct the user to give certain infor- 
mation to the speech unit 2. 

As stated above, also user-network-commands and the 
corresponding vocabulary and grammars can be input by a 
user via the microphone 1 and the digital signal processor 3 
to the central processing unit 4 on demand of control- 
network-commands output as messages by the speech unit 2 
to the user. After the user has uttered a spoken-command to 
set the speech unit 2 into learning state with him, the central 
processing unit 4 performs a dialogue with the user on the 
basis of control-network-commands stored in the memory 8 
to generate new user-network-commands and corresponding 
vocabulary to be stored in the respective sections of the 
memory 7. 

It is also possible that the process of learning new 
user-network-commands is done half- automatically by the 
communication in-between the speech unit 2 and an arbi- 
trary network device and half -dialogue controlled between 
the speech unit 2 and a user. In this way, user-dependent 
user-network-commands for selected network devices can 
be generated. 

As stated above, the speech unit 2 processes three kinds 
of commands, i.e. spoken-commands uttered by a user, 
user-network-commands, i.e. digital signals corresponding 
to the spoken -commands, and control -network-commands 
to perform a communication with other devices or with a 
user to learn new user-network-commands from other 
devices 11 and to assign certain functionalities thereto so 
that a user can input new spoken-commands or to assign a 
new functionality to user-network-commands already 
included. 

Output of the speech unit directed to the user are either 
synthesized speech or pre-recorded utterances. A mixture of 
both might be useful, e.g. pre-recorded utterances for the 
most frequent messages and synthesized speech for other 
messages. Any network device can send messages to the 
speech unit. These messages are either directly in ortho- 
graphic form or they encode or identify in some way an 
orthographic message. Then these orthographic messages 
are output via a loudspeaker, e.g. included in the speech unit 
2. Messages can contain all kinds of information usually 
presented on a display of a consumer device. Furthermore, 
there can be questions put forward to the user in course of 
a dialogue. As stated above, such a dialogue can also be 
produced by the speech unit 2 itself to verify or confirm 
spoken-commands or it can be generated by the speech unit 
2 according to control -network-commands to learn new 
user-network-commands and corresponding vocabulary and 
grammars. 

The speech input and/or output facility, i.e. the micro- 
phone 1 and the loudspeaker, can also be one or more 
separate device(s). In this case messages can be communi- 
cated in orthographic form in-between the speech unit and 
the respective speech input and/or output facility. 

Spoken messages sent from the speech unit 2 itself to the 
user, like which device should be switched on?, could also 
be asked back to the speech unit 2, e.g. which network 
device do you know?, and first this question could be 
answered by the speech unit 2 via speech, before the user 
answers the initial spoken message sent from the speech 
unit. 
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FIG. 2 shows a block diagram of an example of the 
structure of remotely controllable devices according to an 
embodiment of this invention, here a network device 11. 
This block diagram shows only those function blocks nec- 

5 essary for the speech controllability. A central processing 
unit 12 of such a network device 11 is connected via a link 
layer control unit 17 and an I/F physical layer unit 16 to the 
home network bus 10. Like in the speech unit 2, the 
connection in-between the central processing unit 12 and the 
home network bus 10 is bidirectional so that the central 
processing unit 12 can receive user-network-commands and 
control-network-commands and other information data from 
the bus 10 and send control-network-commands, messages 
and other information data to other network devices or a 
speech unit 2 via the bus 10. Depending on the device, it 

15 might also be possible that it will also send user-network- 
commands. The central processing unit 12 is bidirectionally 
coupled to a memory 14 where all information necessary for 
the processing of the central processing unit 12 including a 
list of control-network-commands needed to communicate 

20 with other network devices is stored. Further, the central 
processing unit 12 is bidirectionally coupled to a device 
control unit 15 controlling the overall processing of the 
network device 11. A memory 13 holding all user-network- 
commands to control the network device 11 and the corre- 

2 5 sponding vocabulary and grammars is also bidirectionally 
coupled to the central processing unit 12. These user- 
network-commands and corresponding vocabularies and 
grammars stored in the memory 13 can be down-loaded into 
the extended vocabulary section lb and the extended gram- 

30 mar section Id of the memory 7 included in the speech unit 
2 in connection with a device name for a respective network 
device 11 via the central processing unit 12 of the network 
device 11, the link layer control unit 17 and the I/F physical 
layer unit 16 of the network device 11, the home network bus 

35 system 10, the I/F physical layer unit 6 and the link layer 
control unit 5 of the speech unit 2 and the central processing 
unit 4 of the speech unit 2. In this way all user-network- 
commands necessary to control a network device 11 and 
corresponding vocabulary and grammars are learned by the 

40 speech unit 2 according to the present invention and 
therefore, a network device according to the present inven- 
tion needs no built-in device dependent speech recognizer to 
be controllable via speech, but just a memory holding all 
device dependent user-network-commands with associated 

45 vocabulary and grammars to be downloaded into the speech 
unit 2. It is to be understood that a basic control of a network 
device by the speech unit 2 is also given without vocabulary 
update information, i.e. the basic control of a network device 
without its device dependent user-network-commands with 

50 associated vocabulary and grammars is possible. Basic con- 
trol means here to have the possibility to give commands 
generally defined in some standard, like switch-on, switch- 
off, louder, switch channel, play, stop, etc. 

FIG. 3 shows an example of a network architecture having 

55 an IEEE 1394 bus and connected thereto one speech unit 2 
with microphone 1 and loudspeaker and four network 
devices 11. 

FIG. 4 shows another example of a network architecture 
having four network devices 11 connected to an IEEE 1394 

60 bus. Further, a network device 4 having a built-in speech 
unit with microphone 1 and loudspeaker is connected to the 
bus 31. Such a network device 41 with a built-in speech unit 
has the same functionality as a network device 11 and a 
speech unit 2. Here, the speech unit controls the network 

65 device 11 and the network device 41 which it is built-in. 
FIG. 5 shows further three examples for network archi- 
tectures. Network A is a network similar to that shown in 
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FIG. 3, but six network devices 11 are connected to the bus FIG. 6 shows an example of a home network consisting 

31. In regard to the speech unit 2 that is also connected to of three clusters. One of the clusters is built by an IEEE 1394 

the bus 31, there is no limitation of network devices 11 bus 61 installed in a kitchen of the house. Connected to this 

controllable via said speech unit 2. Every device connected bus is a broadcast receiver 65, a digital television 64, a 

to the bus 31 that is controllable via said bus 31 can also be 5 P™ter 63, a phone 62 and a long distance repeater 66. This 

controlled via the speech unit 2. cluster has also connections to a broadcast gateway 60 to the 

KT „ i u j cc * * £ i u c outside of the house and via the repeater 66 and an IEEE 

Network B shows a different type of network. Here five ^ ?4 tQ ^ ^ « siuin P room , in which also an 

network devices 11 and one speech unit 2 are connected to ffiEE ng \ bug fi? fa t * from ^ bdd ^ a 

a bus system 51. The bus system 51 is organized so that a speech udt ?0> a personal computer 69j a phone 68) a V CR 

connection is only necessary in-between two devices. Net- 10 ? ^ g camcorder 72 and a digital television 73a is connected 

work devices not directly connected to each other can t0 the bus 67 bridge 74 ^ ^ connected to the third 

communicate via other third network devices. Regarding the cluster "study" which comprises an IEEE 1394 bus 78 

functionality, network B has no restrictions in comparison to connected to the bridge 74 via a long distance repeater 75. 

network A. Further, a personal computer 76, a phone 77, a hard disc 79, 

The third network shown in FIG. 5 is a wireless network, 15 a printer 80, a digital television 81 and a telephone NIU 82 

Here, all devices can directly communicate with each other are connected to said bus 78. A telephone gateway 83 is 

via a transmitter and a receiver built into each device. This connected to the telephone NIU 82. 

example shows also that several speech units 2 can be The above described network is constructed so that every 

connected to one network. Those speech units 2 can have device can communicate with the other devices via the IEEE 

both the same functionality or both different functionalities, 20 1394 system, the bridge 74 and the repeaters 66 and 75. The 

as desired. In this way, it is also easily possible to build s P eech umt 70 located m the slttin S room caD communicate 

personalized speech units 2 that can be carried by respective with aUdevices and therewith have the possibility to control 

j ,l * .1 j -«r ♦ ♦ j_ • „ 11 „„ them. This speech umt 70 is built like the speech umt 2 

users and that can control different network devices 11, as \ , , F « it. - t-t^ * 

a • *u ™ v a u ** ~ « described above. Since in the example shown in FIG, 6 

desired by the user. Oi course, personalized speech units can . . r A • t 

uwuw^uiyiw 1. , y y several devices of the same type are present, e.g., the digital 

also be used in a wired network, n companson to a wireless ^vision 3Q m ^ ^ &nd ^ teleyision gl 

speech input and/or output facihty a personalized speech h ±Q stud it fa iWe tQ define ^ defined device 

unit has the advantage that it can automatically log-into names wheQ the netWQrk is set . up 0f when a device is 

another network and all personalized features are available. con nected to the network having already a device of this 

Such a personalized network device can be constructed to ^ type connected thereto, the speech unit 70 will ask the user 

translate only those spoken-commands of a selected user f or names f or these devices, e.g. television in the sitting 

into user-network-commands using speaker-adaption or room and television in the study to be assigned to the 

speaker-verification. This enables a very secure access individual devices. To be able to recognize these names, one 

policy in that an access is only allowed if the correct speaker 0 f t he following procedures has to be done, 

uses the correct speech unit. Of course, all kinds of accesses 35 L ^ user has to enter the olographic form (sequence 

can be controlled in this way, e.g. access to the network of letters) of the device name by typing or spelUng. The 

itself, access to devices connected to the network, like speech unit 70 maps the olographic form into pho- 

access to rooms, to a VCR, to televisions and the like, neme or model sequence; 

Further, electronic phone books my be stored within the 2 . In the case of a personalized speech unit, the user 

speech unit. Calling functions by name, e.g. office, is 4Q utterance corresponding to the device name can be 

strongly user-dependent and therefore such features will be stored ^ a f eaUire vector sequence, that is directly used 

preferably realized in personalized speech units. Also during recognition as reference pattern in a pattern 

spoken-commands as switch on my TV can easily be matching approach; 

assigned to the correct user-network-commands controlling 3 The phoneme sequent corresponding to the name can 

the correct device, as it may be the case that different users 45 be leSLm6 automatically using a phoneme recognizer, 

assign different logical names therefore and the speech unit Tfle Ufier has theQ Qnly tQ addre&s these devices by name> 

2 has to generate the same user-network-command when e g television in the sitting room. The speech unit 70 maps 

interpreting different spoken-commands. On the other hand, (he name tQ me appropriate network address. By default, the 

it is possible that the network e.g. comprises more than one name ^^0^5 l0 tn e functionality of the device. All 

device of the same type, e.g. two TVs, and the speech unit 5Q commands uttered by a user we sent to the device named at 

2 has to generate different user-network-commands when last Qf course it ^ ^ t hat these names are 

interpreting the same spoken-command uttered by different changed later on. 

users, e.g. switch on my TV. In raan y situations a person might wish to access his 
One speech unit can contain personalized information of device at home over the phone, e.g. to retrieve faxes or to 
one user or different users. In most cases the personalized 55 control the heating remotely. Two alternative architectures to 
speech unit corresponding to only one user will be portable realize such a remote access are illustrated in FIG. 7. 
and wireless, so that the user can take it with him/her and has piG. la shows that a speech unit 2 is connected to the 
the same speech unit at home, in the car or even other home network having a network device 11 and to the public 
networks, like in his/her office. telephone network. A spoken-command from a user is 
The personalized speech unit can be used for speaker 60 transmitted via the public telephone network to the speech 
verification purposes. It verifies the words of a speaker and unit 2 that translates the spoken-command into a user- 
allows the control of selected devices. This can also be used network-command to control the network device 11. A user 
for controlling access to rooms, cars or devices, such as can control his home network independently from any other 
phones. devices but the speech unit 2 from any place he likes when 

A personalized speech unit can contain a speech recog- 65 he has an access to the public telephone network, 

nizer adapted to one person which strongly enhances the FIG. lb shows another example in which a user having a 

recognition performance. personalized speech unit 2 is within the reception area of an 
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arbitrary home network A. He utters a spoken-command into 
his personalized speech unit 2 that translates said spoken - 
command into a user-network-command and sends it to the 
home network A. The home network A sends the generated 
user-network -command via the transceivers 84 and the 5 
public telephone network to a home network B in which the 
network device 11 is located that gets controlled by the 
translated spoken-command uttered by the user. Of course, 
these features strongly depend on the available networks. 

As described above, the speech unit 2 has a speech output 10 
facility either included or connected thereto directly or via a 
network so that messages from network devices can be 
synthesized into uttered sequences and a dialogue 



NO_l_31 in a step S45 and a second number correspond- 
ing to a word sequence NO„l-12 in a step S46. 

The lower part of FIG. 8 shows vocabularies correspond- 
ing to these word sequences. For example, the word 
sequence U_YES can be represented by the words yes or 
yen, the word sequence U_NO can be represented by the 
vocabulary no, the word sequence NO_l_31 can be rep- 
resented by the vocabularies one, two, . . , thirty-one, first, 
second, . . . , thirty-first and the word sequence NO„l„12 
can be represented by the vocabularies one, . . . twelve, first, 
. . . twelfth. 

FIG. 9 shows an example or the interaction between a 
user, a speech unit and a network device. 

First, the user utters the spoken -command play. In the 
shown case, the speech unit knows that more than one 



in-between the speech unit 2 and the user can be performed. 

Such a dialogue would for example also be useful for the 15 device connected tothe network can be played. It determines 
programming of a VCR. The messages can also provide mat tne spoken-command play does not comprise enough 
additional information about thr-sTaTuTl?Tfie^^ information to control a specific network device. Therefore, 

it outputs the message which device should be played? to the 
user. The answer to this message of the user to the speech 



d evice , e.g. the titles of the CDs co ntaine a in a juke box. In 
IjeneraTTTrle^number and type oTmessage is not fixed. 

FIG. 8 shows examples for a part of a grammar for a user 20 unit is VCR. Now the speech unit determines that the user 



dialogue during VCR programming. "S" are system ques- 
tions; "U" denotes spoken-commands or other user utter- 
ances. Possible spoken-commands or user utterances at each 
dialogue step are defined by the word grammar and the 
vocabularies. Grammars, e.g. finite state transition 
grammars, are used to restrict the set of word sequences to 
be recognized or to specify a sequence of dialogue steps, e.g. 
needed to program a video recorder. A different finite state 
grammar may be specified to each dialogue step. These 



did pr ovide enoug h info rmation to control a specific netwo rk 
devicTas desired, here to set the VCR into the play state. 
Therefore, it transmits the corresponding user-network- 
command PLAY to the VCR address via the network. The 
25 VCR receives this user-network-comraand and tries to per- 
form the associated action. In the shown case the VCR 
cannot detect a cassette, therefore it cannot be set into the 
play state and sends an error message to the device from 
which the user- network-command PLAY was received. In 



grammars are directly used by the speech unit. On the other 30 this case, an error ID X is sent to the device address of the 



35 



40 



hand, these grammars are entirely device -dependent. 
Therefore, it is not practical to have static finite state 
grammars in the speech unit. It is rather proposed that a 
device newly connected to the network can send its specific 
set of grammars to the speech unit. 

As it is shown in the above part of FIG. 8, a dialogue 
grammar could be that in a step SI the system asks the user 
for a channel, i.e. outputs a message channel? to the user. In 
a following step S2 the user inputs a word sequence 
U_CHANNEL to the system as spoken-command. 
Thereafter, in step S3 the system asks if the action to be 
programmed should be taken today. In the following step S4 
the user inputs a word sequence U_Y/N_DATE to the 
system, telling yes, no or the date at which the action to be 
programmed should take place. If the date corresponds to 45 
today or the user answers the question of the system with 
yes, the system asks in a following step S5 which movie. 
Thereafter, in a step S6 the user informs the system about the 
film with a word sequence U_FILM. If the user has 
answered no in step S4, the system asks for the date in step 
S7. In the following step S8 the user inputs the date to the 
system as spoken-command in a word sequence U„DATE, 
thereafter the process flow continues with step S5. In the 
middle of FIG. 8 examples of the grammar for word device cannot be controlled by speech. After the device has 
sequences corresponding to the above example are shown. 55 received the request for its ID from the speech unit, it sends 
In step S4 the user can input yes, no or a date as a word its ID to the address of the speech unit. Thereafter, the 
sequence U_Y_N_DATE. Therefore, as it is shown in the speech unit sends a request for the user-network-command 
first line for the word sequence U_Y/N_DATE, the user has list of the device to the corresponding device address, 
the possibility to input a word sequence U_Y/N in a step Ha ving received this request, the network device sends its 
S41 or a word sequence U_DATE in a step S42. In the 60 use r-network-command list to the speech unit, the spe ech 
second line for the word sequence U_Y/N the two possi- uni t receives the user-network-command list, updates its 
bilities for the user input are shown, namely the word vocabular y with the vocabulary and grammars receiv ed 
sequence U__NO in a step S43 or the word sequence from, ft? device and sends an a cknowledgement receip t to ^ 
U__YES in a step S44. The third line for the word sequence the Hftyjrp. aHHrp.Es of the Hevira The device can now be 
U__DATE shows the possible word sequences for user 65 controlled by speech. Preferably the speech unit notifies the 
inputs for a date, here a sequence of two numbers is allowed user that a new device providing new speech control func- 
as input, a first number corresponding to a word sequence tionality is available after such a procedure. 



50 



speech unit. The speech unit receives this error message, 
recognizes it and outputs a corresponding message sorry, 
there is no cassette in the VCR to the user. 

The speech unit acts as an interface i n-between th e 
network, including ^^cV^oj^Lji&vieGsrton^ctQd theretoT 
and Inne jTf more ns e^The users just have to utter spoken- 
Sfnmands to the speech unit connected to the network when 
they are within the reception area thereof and the speech unit 
.that basically knows the state of the network or ca n genera te 
it/verifies if miiipleie uscr-neiwork-commandsj^ULtie^en-^ ) 
erated-or otherwise s ends a message to theusexjirerisely ( 
a sking for a missing part of a message or spoken -mm m and J 
to bea blg to properly generate the corresponding user- 
networjc^ojDmaiid. 

"Thespeech unit has to keep track of devices connected to 
the network to eventually retrieve and understand new 
functionalities. Thsf efore the speech unit will check all ^ 
connected d evices for new speech control functionality. An 
exemplary process flow isshown in FIG. 10. It al so has to 
keep tra ck if devices are disconnected t'rom the netwo rkr >— 
"TifstTthe speech unit sends a request for the ID, including 
name and device type, to the device address of a network 
device connected to the network. In this state, the network 



3 



05/10/2004, EAST version: 1.4.1 



US 6,535,854 B2 



11 



12 



If a new device is connected to the network it is possible 
that it broadcasts its ID, comprising network address, name 
and device type. FIG. 11 shows an example of such an 
initialization. Here it is shown that the device offering the 
new speech control functionality gives some kind of noti- 
fication to the speech unit, then after sending the user- 
network-command list request and receiving the user- 
network -command list the speech unit asks a user to give a 
logical name for the newly connected device. The user then 
types or spells the name of the newly connected device so 
that the speech unit can receive it. Of course, it is also 
possible that the user just utters the new name. The logical 
name given by the user is received by the speech unit that 
updates the vocabulary and grammars and sends a confir- 
mation of reception to the IEEE 1394 device that has been 
newly connected to the network. This device can now be 
controlled by speech. 

The command list sent from the device to the speech unit 
can either exist of only the orthographic form of the spoken- 
commands in conjunction with the appropriate user- 
network-command or it additionally provides the 
pronunciation, e.g. phonemic transcriptions, for these 
spoken-commands. The speech units present vocabulary is 
then extended with these new user-network -commands. In 



inserted in a player for the first time, when a device capable 
of receiving programs is attached to the bus, to other 
connected devices, e.g. all devices of the home network. 
There might be more than one device that tries to answer the 
5 request. Possible devices might be for example: 

A device capable of reading a delivered database, e.g. a 
CD-ROM contained in a disc jukebox, a video tape 
player telling the content of the tape, a data base that 
has been entered by the user, e.g. on his PC, a set top 
30 box telling the channels, i.e. program names, it can 
receive; 

a device connected to another information transport 
mechanism, e.g. a WEB-TV, a set-top-box, a DAB 
receiver, a PC, that at least sometimes is connected to 
is the internet or has a modem to connect to a system 
holding program channel information or other infor- 
mation; 

a device communicating with the user that queries a 
content, e.g. by asking him/her how a frequently played 
20 song is called, what program he is currently watching, 
etc., a dialogue initiated by the user about a newly 
bought media and the wish to enter the titles by typing, 
spelling or speaking. 
FIG. 12 shows an example for the interaction of multiple 



case the user-network-command list only gave the orthog- 25 devices for vocabulary extensions concerning media con- 

raphy of the spoken-commands but not the transcriptions, a tents information is delivered by neither the speech unit 

built-in grapheme-to-phoneme conversion section 7/gener- nor the device holding the media in this case. After a new 

ates the pronunciations and their variations and thus com- medium is inserted for the first time in a media player, the 

pletes the user-network-command list. After updating the media player sends a notification of insertion of medium X 

vocabulary and grammars the new device can be fully 30 to me speech unit. The speech unit then sends a content 



controlled by speech. 

If such a handshake procedure in-between a newly con- 
nected device and the speech unit is not performed, only a 
basic functionality of the device is provided by some user- 
network-commands stored in the initial vocabulary con- 
tained in the speech unit that matches to the user-network- 
commands of said device. It is also possible that user- 
network-commands used for other devices can be adapted to 
the new device, but the full controllability by speech cannot 



be guaranteed in this way. Still the speech unit has to know 40 pi ano Concert b-minor. 



query for medium X in form of a control-network-command 
to the media player and to all other connected network 
devices. One of the other connected network devices sends 
thereafter the content information for medium X to the 
35 speech unit. The speech unit updates its vocabulary and 
sends an acknowledge receipt to the media player and the 
other connected network device that has sent the content 
information for medium X. The medium content can now be 
accessed by spoken-commands, e.g. play Tschaikowsky 



the ID of said device to have an access, so some kind of 
communication in-between the speech unit and the device or 
another device knowing the ID has to take place. 

Commands that include med ia descri ptions, e. g.. the name 
of a CD, song titles, mo viejitl es. or station names indu ce 45 
v ocabularies that are in part unknown to the speech un it. 
Hence, this information has to be acquired from other 
sources. Current state of the art is that the user enters this 
information by typing or spelling. The speech unit according 
to the invention, on the other hand can dynamically create 
the vocabulary and/or grammars similar to the processes as 
described above. The name and/or pronunciation of a media 
description or program name is acquired in one of the 
following ways: 

From a database delivered by someone on some media, 55 

e.g. CD-ROM; 
the medium, e.g. CD, Digital Video Broadcast (DVB), 
itself holds its description and optionally also the 
pronunciation of its description, e.g. artists names and 
song titles are machine readable included on a CD; 
from a database accessed over an information transport 
mechanism, e.g. the internet, Digital Audio Broadcast 
(DAB), a home network, telephone lines. Besides these 
methods the user might enter it by typing or spelling, 



FIG. 13 shows another example for the interaction of 
multiple devices for vocabulary extension concerning media 
contents. In this case two devices answer the query. The first 
answer is chosen to update the vocabulary while the second 
answer is discarded. 

After a new medium is inserted for the first time in a 
media player, the media player sends a notification of 
insertion of medium X in form of a control-network- 
command to the speech unit. The speech unit sends then a 
50 content query for medium X to the media player and all 
other connected network devices. In this case the media 
player sends the content information for medium X, since 
the content description is entailed on the medium. The 
speech unit then updates its vocabulary and/or grammars 
and sends an acknowledge receipt in form of a control - 
network-command to the media player. If the content infor- 
mation for medium X is thereafter delivered by another 
connected network device, the speech unit discards this 
information. 

It might also be possible that a database delivered on some 
medium, e.g. a CD-ROM, or a database stored in the 
internet, i.e. an internet page, or transmitted via digital 
broadcasting contains the user-network-commands and cor- 
responding vocabulary and/or grammars of a remotely con- 



60 



To acquire such information, the speech unit or any other 65 tro liable network device, in this case this information can be 
device issues an information seeking request asking for the downloaded by the speech unit 2 like the media descriptions, 
content of a medium or a program, e.g., when a new CD is e.g. when a new device 11 is connected to the network or 
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when a user initiates a vocabulary update. Such devices need 
not to carry this information in a memory, but it can be 
delivered with the device 11 on a data carrier that can be read 
by another device 11 connected to the network or it can be 
supplied by the manufacturer of the device via the internet 
or digital broadcasting. 
We claim: 

1. Speech unit for generating user-network-commands 
according to electric signals provided by a microphone to 



control a remotely controllable device connected to said a0 public telephone network. 



sage from said remotely controllable device, said user- 
network-commands are requested by control-network- 
commands from a control unit in said speech unit, said 
speech unit generates and stores in a memory, new user- 
network-commands and corresponding vocabulary to con- 
trol said remotely controllable device based on a dialogue 
between said speech unit and a user. 

11. Speech unit according to claim 10, characterized in 
that said interface is connected to said network system via a 



20 
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speech unit, characterized by; 

a control unit in said speech unit to send control-network- 
commands to said device connected to said speech unit 
so that said device transmits device or medium depen- 
dent vocabulary and/or grammars and corresponding 15 
user-network-commands to said speech unit and to 
receive data and messages from said device; and 

a memory to store said device or medium dependent 
vocabulary and/or grammars and corresponding user- 
network -commands transmitted by said remotely con- 
trollable device connected to said speech unit, 

said speech unit generates and stores in said memory, new 
user-network-commands and corresponding vocabu- 
lary to control said remotely controllable device based 
on a dialogue between said speech unit and a user. 

2. Speech unit according to claim 1, characterized by an 
interface connected to a network system to which a remotely 
controllable device can be connected that can be controlled 
via said network system, to send generated user-network- 
commands and control-network-commands via said network 
system to said remotely controllable device and to receive 
data and messages from said remotely controllable device. 

3. Speech unit according to claim 2, characterized in that 
said control unit determines what kind of devices are con- 
nected to a network system, to send said control-network- 
commands to said devices, and to receive data from said 
devices. 

4. Speech unit according to claim 1, characterized in that 
said device is wired or wireless connected to said, speech 
unit. 40 

5. Speech unit according claim 1, characterized by a 
memory to initially store general vocabulary and grammars 
based on which general user-network-commands are gener- 
ated. 

6. Speech unit according to claim 1, characterized by a 45 
speaker recognition section to identify different users 
according to said electric signals provided by said micro- 
phone to be able to generate speaker dependent user- 
network-commands. 

7. Speech unit according to claim 1, characterized by a 50 
speech synthesizer to synthesize messages from said device 
and from said speech unit itself and to output them to a user 
via a loudspeaker. 

8. Speech unit according to claim 1, characterized by a 
microphone and/or a loudspeaker. 

9. Speech unit according to claim 1, characterized in that 
said microphone and/or a loudspeaker are/is remotely con- 
nected to said speech unit either wired or wireless either 
direct or via a network. 

10. Speech unit for generating user-network-commands 
according to electric signals provided by a microphone to 
control a remotely controllable device characterized by an 
interface connected to a network system to which said 
remotely controllable device is connected that can be con- 
trolled via said network system, to send generated user- 
network-commands via said network system to said 
remotely controllable device and to receive data and mes- 
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12. Speech unit according to claim 10, characterized in 
that said interface is connected to said network system via 
another network system. 

13. Remotely controllable device, comprising: 

a first control unit to extract user- network-commands 
directed to said device and to control the functionality 
of said remotely controllable device according to said 
extracted user-network-commands, characterized in 
that said first control unit also extracts control-network- 
commands directed to said remotely controllable 
device, said control-network-commands are sent by a 
second control unit in a speech unit, and, according to 
said extracted control-network-commands, controls the 
transmission of device dependent user-network- 
commands and corresponding vocabulary and/or gram- 
mars stored in a memory of said remotely controllable 
device useable by said speech unit connected thereto to 
convert spoken-commands from a user into user- 
network-commands to control the functionality of said 
remotely controllable device, said speech unit gener- 
ates and stores in said memory, new user-network- 
commands and corresponding vocabulary to control 
said remotely controllable device based on a dialogue 
between said speech unit and said user. 

14. Remotely controllable device according to claim 13, 
characterized by an interface connected to a network system 
to which said device and said speech unit can be connected 
to receive and transmit commands, data and messages. 

15. Remotely controllable device, comprising: 

a first control unit also extracts control-network- 
commands directed to said device and to control the 
functionality of said remotely controllable device 
according to said extracted user-network-commands, 
characterized in that said first control unit also extracts 
control-network-commands directed to said remotely 
controllable device, said control-network-commands 
are sent by a second control unit in a speech unit, and, 
according to said extracted control-network- 
commands, controls the transmission of medium 
dependent user- network-commands and corresponding 
vocabulary and/or grammars stored on a medium 
accessable by said remotely controllable device useable 
by said speech unit connected thereto to convert 
spoken-commands from a user into a user-network- 
commands to control the functionality of said remotely 
controllable device in regard to said accessable medium 
or to control the functionality of said or another 
remotely controlled device, said speech unit generates 
and stores in a memory, new user- network -commands 
and corresponding vocabulary to control said remotely 
controllable device based on a dialogue between said 
speech unit and said user. 

16. Remotely controllable device according to claim 15, 
characterized in that said medium accessable by said 
remotely controllable device is a compact disc. 

17. Remotely controllable device according to claim 15, 
characterized in that said medium accessable by said 
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remotely controllable device is an internet page or informa- 
tion page transmitted via digital broadcasting. 

18. Method of self-initialization of a speech unit con- 
nected to a remotely controllable device, comprising the 
following steps: 5 

a) send a control- network-command from a control unit in 
said speech unit to said remotely controllable device to 
control said device to transmit device or medium 
dependent user-network-commands to control said 
device and the corresponding vocabulary and/or gram- 10 
mars; 

b) receive said device or medium dependent user- 
network -commands and the corresponding vocabulary 
and/or grammars from said device; ^ 

c) update vocabulary and/or grammars and the corre- 
sponding user-network-commands in a memory; and 

d) generate and store in said memory new vocabulary 
and/or grammars and corresponding user-network- 
commands based on a dialogue between said speech 2 o 
unit and a user. 

19. Method according to claim 18, charaterized by the 
following steps: 

ask for a logical name or identifier of said device offering 
the device dependent user-network-commands and the 25 
corresponding vocabulary and/or grammars; 

receive logical name or identifier; and 
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assign vocabulary and grammars and corresponding user- 
network-commands for said device to the received 
logical name or identifier when said vocabulary and/or 
grammars and the corresponding user-network- 
commands are updated in said memory in order to 
create device dependent user-network-commands. 

20. Method according to claim 19, characterized in that 
said logical name of said device is either determined by a 
user or by said device itself. 

21. Method according to claim 19, characterized in that 
said identifier includes address and name of said device. 

22. Method according to claim 18, characterized by the 
following steps: 

send a control-network-command to identify a user 
dependent mapping for the vocabulary and/or gram- 
mars and corresponding user-network-commands; 

receiving at least one of a name, an identifier, and a speech 
sample of a user that the dependency should be created 
for; and assign the vocabulary and/or grammars and 
corresponding user-network-commands for said device 
to a received name or names, an identifier or identifiers 
or speech sample of said user when said vocabulary 
and/or grammars and the corresponding user-network- 
commands are updated in said memory in order to 
create user dependent user-network-commands, 

* * * * * 
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